Exploratory Analysis of Contributions to 2016 Presidential Candidates Dataset by Patrick Kennedy

Univariate Plots Section

##  [1] "cmte_id"           "cand_id"           "cand_nm"          
##  [4] "contbr_nm"         "contbr_city"       "contbr_st"        
##  [7] "contbr_zip"        "contbr_zip_5"      "contbr_employer"  
## [10] "contbr_occupation" "contb_receipt_amt" "contb_receipt_dt" 
## [13] "receipt_desc"      "memo_cd"           "memo_text"        
## [16] "form_tp"           "file_num"          "tran_id"          
## [19] "election_tp"
## cont$cand_nm: Bush, Jeb
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -5400     100     500    1167    2700   15000 
## -------------------------------------------------------- 
## cont$cand_nm: Carson, Benjamin S.
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -7100.0    25.0    50.0   153.3   100.0 10000.0 
## -------------------------------------------------------- 
## cont$cand_nm: Christie, Christopher J.
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -2700     100    1350    1444    2700    5400 
## -------------------------------------------------------- 
## cont$cand_nm: Clinton, Hillary Rodham
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -5400.0    25.0    51.0   438.6   250.0  5400.0 
## -------------------------------------------------------- 
## cont$cand_nm: Cruz, Rafael Edward 'Ted'
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -5450      35      50     251     100   10800 
## -------------------------------------------------------- 
## cont$cand_nm: Fiorina, Carly
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -2700.0    50.0   100.0   259.8   250.0  5000.0 
## -------------------------------------------------------- 
## cont$cand_nm: Graham, Lindsey O.
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -2700.0   200.0   500.0   941.3  1500.0  5400.0 
## -------------------------------------------------------- 
## cont$cand_nm: Huckabee, Mike
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -2700.0    43.0   100.0   452.4   500.0  5400.0 
## -------------------------------------------------------- 
## cont$cand_nm: Jindal, Bobby
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -2700.0   100.0   250.0   778.6  1375.0  2700.0 
## -------------------------------------------------------- 
## cont$cand_nm: Kasich, John R.
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1000.0   100.0   250.0   835.2  1000.0  2700.0 
## -------------------------------------------------------- 
## cont$cand_nm: Lessig, Lawrence
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   27.94  137.50  250.00  633.70  500.00 2700.00 
## -------------------------------------------------------- 
## cont$cand_nm: O'Malley, Martin Joseph
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -2700.0   100.0   250.0   656.2   750.0  5400.0 
## -------------------------------------------------------- 
## cont$cand_nm: Pataki, George E.
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   250.0   250.0  1000.0   666.7  1000.0  1000.0 
## -------------------------------------------------------- 
## cont$cand_nm: Paul, Rand
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -2700.0    25.0    50.0   234.6   201.6  5000.0 
## -------------------------------------------------------- 
## cont$cand_nm: Perry, James R. (Rick)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -2700     250    1000    1305    2700    2700 
## -------------------------------------------------------- 
## cont$cand_nm: Rubio, Marco
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -5400.0    50.0   100.0   472.8   500.0 10800.0 
## -------------------------------------------------------- 
## cont$cand_nm: Sanders, Bernard
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -3000      10      35      69      50    5000 
## -------------------------------------------------------- 
## cont$cand_nm: Santorum, Richard J.
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -2700     100     500    1066    2700    5400 
## -------------------------------------------------------- 
## cont$cand_nm: Stein, Jill
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    20.0   103.0   250.0   186.7   250.0   250.0 
## -------------------------------------------------------- 
## cont$cand_nm: Trump, Donald J.
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -2700.0   100.0   250.0   352.7   276.1  5400.0 
## -------------------------------------------------------- 
## cont$cand_nm: Walker, Scott
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -2700.0   250.0   500.0   811.6  1000.0  5400.0 
## -------------------------------------------------------- 
## cont$cand_nm: Webb, James Henry Jr.
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    20.0   100.0   500.0   545.8   500.0  2700.0
## cont$cand_nm: Bush, Jeb
## [1] 1166.82
## -------------------------------------------------------- 
## cont$cand_nm: Carson, Benjamin S.
## [1] 153.3212
## -------------------------------------------------------- 
## cont$cand_nm: Christie, Christopher J.
## [1] 1444.185
## -------------------------------------------------------- 
## cont$cand_nm: Clinton, Hillary Rodham
## [1] 438.6258
## -------------------------------------------------------- 
## cont$cand_nm: Cruz, Rafael Edward 'Ted'
## [1] 250.9629
## -------------------------------------------------------- 
## cont$cand_nm: Fiorina, Carly
## [1] 259.7782
## -------------------------------------------------------- 
## cont$cand_nm: Graham, Lindsey O.
## [1] 941.3373
## -------------------------------------------------------- 
## cont$cand_nm: Huckabee, Mike
## [1] 452.397
## -------------------------------------------------------- 
## cont$cand_nm: Jindal, Bobby
## [1] 778.6207
## -------------------------------------------------------- 
## cont$cand_nm: Kasich, John R.
## [1] 835.159
## -------------------------------------------------------- 
## cont$cand_nm: Lessig, Lawrence
## [1] 633.6856
## -------------------------------------------------------- 
## cont$cand_nm: O'Malley, Martin Joseph
## [1] 656.2477
## -------------------------------------------------------- 
## cont$cand_nm: Pataki, George E.
## [1] 666.6667
## -------------------------------------------------------- 
## cont$cand_nm: Paul, Rand
## [1] 234.6485
## -------------------------------------------------------- 
## cont$cand_nm: Perry, James R. (Rick)
## [1] 1305.47
## -------------------------------------------------------- 
## cont$cand_nm: Rubio, Marco
## [1] 472.8235
## -------------------------------------------------------- 
## cont$cand_nm: Sanders, Bernard
## [1] 69.00485
## -------------------------------------------------------- 
## cont$cand_nm: Santorum, Richard J.
## [1] 1065.83
## -------------------------------------------------------- 
## cont$cand_nm: Stein, Jill
## [1] 186.7143
## -------------------------------------------------------- 
## cont$cand_nm: Trump, Donald J.
## [1] 352.6604
## -------------------------------------------------------- 
## cont$cand_nm: Walker, Scott
## [1] 811.5589
## -------------------------------------------------------- 
## cont$cand_nm: Webb, James Henry Jr.
## [1] 545.8064
## [1] 124763     19

## Warning: Removed 9072 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).

## Warning: Removed 15094 rows containing non-finite values (stat_bin).

## Warning: Removed 2 rows containing missing values (geom_bar).

Univariate Analysis

What is the structure of your dataset?

dim(cont)
## [1] 124763     19

What is/are the main feature(s) of interest in your dataset?

Contributions in terms of size and number of total contributions by candidate.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

City of contributor and zipcode of contributor I believe will support the investigation as it is my hypothesis that cities and even zipcodes will offer particular political profiles.

Did you create any new variables from existing variables in the dataset?

I created a few filters to visualize the data a bit easier. For one I created a list of leading candidates in the overall race as well as the top 5 and bottom 5 populated cities in Texas. I then created a new dataframe filtered by these lists for analyses in my Bivariate and Multivariate sections below.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

The contribution amount seemed to follow particular denomonations (i.e. $10, $25, $50, $100) rather than form a fluid distribution. In addition there were a number of negative values. Upon further investigation into the dataset, the negative contributions were due to contributors reposting the donated amount in a spouse’s name, reallocating the donation to the general party or changing the donated amount.

Given the high number of candidates I subsetted the data on leading candidates and plan on using that particular data as I perform more granular searches i.e. voting by zipcode.

Bivariate Plots Section

## Warning: Removed 2550 rows containing non-finite values (stat_boxplot).
## Warning: Removed 2550 rows containing missing values (geom_point).

## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:memisc':
## 
##     collect, query, rename
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:GGally':
## 
##     nasa
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

The median donated amounts per candidate did not vary wildly however Chris Christie showed the highest median contribution amount at $1350, while Ben Carson, Hillary Clinton, Ted Cruz, Rand Paul, and Bernie Sanders had mean donated amounts around $50. This finding can be misleading as the candidates with the most contributions in Texas are Ted Cruz (61k), Ben Carson (19k), Bernie Sanders (14k) and Hillary Clinton (13k), while the remaining candidates have less than 5000 contributions each.

This illuminates number of contributions and median contributions and as such while Ted Cruz’s donations have a low median dollar amount, he has the largest number of contributions. This begs the question how much each candidate has raised thus far. Not surprisingly, Ted Cruz’s home state has raised over $15MM with Hillary Clinton in a distant second place with roughly $6MM raised. The majority of candidates did not raise more than $1MM including Bernie Sanders who as shown above has roughly 14000 contributions, slightly more than Hillary Clinton and behind Ben Carson and Ted Cruz.

Further investigation shows that Bernie Sanders has the lowest contribution by contributor ratio, followed closely by Ben Carson. Jeb Bush, Chris Christie, Rick Perry and Rick Santorum all have contribution by contributor ratios greater than $1000.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

A couple interesting findings were 1) Donald Trump’s fundraising is almost non-existant in Texas and 2) Bernie Sanders’s fundraising is greater than I imagined it would be in Texas.

What was the strongest relationship you found?

The clearest relationship is the both the number of contributions by city (bigger city, more contributions) and number of contributions by candidate (Ted Cruz gathering the most).

Multivariate Plots Section

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

By looking at just the top 5 and bottom 5 cities in Texas by population, and looking at the current leading candidates, some interesting trends emerge. For one, it is quite clear that Ted Cruz dominates the contribution scene across Texas cities.

Were there any interesting or surprising interactions between features?

While Austin pulls more Bernie Sanders and Hillary Clinton contributions than other cities, Houston has a sizable amount of Bernie Sanders contributions, second only to Austin. Size of city certainly seems to impact the number and size of contributions (not surprisingly). Upon looking deeper into the mix of zipcodes within cities, it becomes clear the contribution depends on particular zipcodes. Houston’s 77006 for instance has the largest number of Bernie Sanders contributions. That zipcode’s highlights are two prominent art collections and the University of St. Thomas with mean ages between 25-30. Contrast that with zipcode 77024 which sports a mean age of 45-55 outside of the inner loop of Houston proper.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.


Final Plots and Summary

Plot One

Description One

Plot One illustrates the number of contributions to 2016 political candidates in the state of Texas. Ted Cruz garners a significant majority of the contributions at over 60000, four times as many contributions as Bernie Sanders and Hillary Clinton.

Plot Two

Description Two

Plot Two expands upon Plot One by illustrating contributions to 2016 political candidates in the state of Texas by the five most populated and five least populated cities in the state. The number of candidates was reduced to the current leading candidates in the race.

Plot Three

Description Three

Plot Three expands upon Plot Two by investigating contributions to leading candidates made in the city of Houston by zipcode.


Reflection

Ted Cruz dominates Texas. He garners a staggering number of contributions across the state, his home turf. However, his influence is not without some challenge. While Houston favors him more strongly than any other Texas city, Houston’s second and third largest contribution bases are Hillary Clinton and Bernie Sanders, respectively. Additionally, Austin holds the second largest number of contributors in Texas and Bernie Sanders and Hillary Clinton have more contributors in the city than Ted Cruz. But who is contributing? Looking deeper into Houston zipcodes, a split is seen. While Ted Cruz still is a significant player, zipcodes with younger demographics yield higher Bernie Sanders and Hillary Clinton support.

While the macro view shows a domination by Ted Cruz in terms of his contributing base, his lead is driven primarily by two cities (Houston and San Antonio) and even within Houston there are signs that a sizable percentage of youth veers toward Democratic candidates.

And in all of this, where is Trump? Nowhere in any meaningful way. Additionally, Marco Rubio is not well capitalized. Is this due to the assumed Cruz dominance as he represents the state as a Senator? The answer is unclear at this point. Through this analysis however, Ted Cruz’s dominance is not as solid as it has initially appeared to be.

It is interesting to be able to gather quite narrow and specific data with regard to the contributors to political compaigns (and rightly so). Certainly doing additional analyses investigating occupation and political contributions could add richness to the analysis. Also, incorporating demographics directly into the dataset, joining on zipcodes or even street names, could add a layer of detail. This could ultimately yield the basis for a predictive model in which contribution levels can be forecasted with even greater accuracy resulting in better information for both contributors and the parties in which those contributions rest.